A Japanese Particle Corpus Built by Example-Based Annotation

نویسندگان

  • Hiroki Hanaoka
  • Hideki Mima
  • Jun'ichi Tsujii
چکیده

This paper is a report on an on-going project of creating a new corpus focusing on Japanese particles. The corpus will provide deeper syntactic/semantic information than the existing resources. The initial target particle is to which occurs 22,006 times in 38,400 sentences of the existing corpus: the Kyoto Text Corpus. In this annotation task, an “example-based” methodology is adopted for the corpus annotation, which is different from the traditional annotation style. This approach provides the annotators with an example sentence rather than a linguistic category label. By avoiding linguistic technical terms, it is expected that any native speakers, with no special knowledge on linguistic analysis, can be an annotator without long training, and hence it can reduce the annotation cost. So far, 10,475 occurrences have been already annotated, with an inter-annotator agreement of 0.66 calculated by Cohen’s kappa. The initial disagreement analyses and future directions are discussed in the paper.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Japanese Dialogue Corpus of Multi-Level Annotation

This paper describes a Japanese dialogue corpus annotated with multi-level information built by the Japanese Discourse Research Initiative, Japanese Society for Artificial Intelligence. The annotation information consists of speech, transcription delimited by slash units, prosodic, part of speech, dialogue acts and dialogue segmentation. In the project, we used the corpus for obtaining new find...

متن کامل

A Japanese Word Dependency Corpus

In this paper, we present a corpus annotated with dependency relationships in Japanese. It contains about 30 thousand sentences in various domains. Six domains in Balanced Corpus of Contemporary Written Japanese have part-of-speech and pronunciation annotation as well. Dictionary example sentences have pronunciation annotation and cover basic vocabulary in Japanese with English sentence equival...

متن کامل

The Japanese FrameNet Software Tools

This paper describes an ongoing project “Japanese FrameNet (JFN)”, a corpus-based lexicon of Japanese in the FrameNet style. This paper focuses on the set of software tools tailored for the JFN annotation process. As the first step in the annotation, annotators select target sentences from the JFN corpus using the JFN kwic search tool, where they can specify cooccurring words and/or the part of...

متن کامل

Word Alignment Annotation in a Japanese-Chinese Parallel Corpus

Parallel corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word alignment is of significance to provide gold-standard for developing and evaluating both example-based machine translation model and statistical machine translation model. This paper presents the work...

متن کامل

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and its Application

In Japanese, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, bunsetsus in Japanese, based on a dependency grammar. In many cases, the syntactic structure of a bunsetsu is not considered in syntactic structure annotation. This paper gives the criteria and definitions of dependency relationships between words in a bunsetsu and their applic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010